Many things can make one’s heart smile with joy. HappyDB is “a corpus of 100,000 crowd-sourced happy moments”. In this project I will carry out an exploratory data analysis of the corpus of HappyDB and look deeper into the causes that make us happy.

Overall

## Observations: 99,564
## Variables: 18
## $ hmid                  <int> 27673, 27674, 27675, 27676, 27677, 27678...
## $ wid                   <int> 2053, 2, 1936, 206, 6227, 45, 195, 740, ...
## $ reflection_period     <fct> hours_24, hours_24, hours_24, hours_24, ...
## $ original_hm           <chr> "I went on a successful date with someon...
## $ cleaned_hm            <chr> "I went on a successful date with someon...
## $ modified              <chr> "True", "True", "True", "True", "True", ...
## $ num_sentence          <int> 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1...
## $ ground_truth_category <chr> NA, NA, NA, "bonding", NA, "leisure", NA...
## $ predicted_category    <chr> "affection", "affection", "exercise", "b...
## $ id                    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1...
## $ text                  <chr> "successfully date sympathy connected", ...
## $ age                   <chr> "35", "29.0", "30", "28", "55", "23", "3...
## $ country               <chr> "USA", "IND", "USA", "DNK", "USA", "IND"...
## $ gender                <chr> "m", "m", "f", "f", "f", "m", "m", "m", ...
## $ marital               <chr> "single", "married", "married", "married...
## $ parenthood            <chr> "n", "y", "y", "n", "y", "n", "n", "n", ...
## $ count                 <int> 4, 3, 3, 6, 5, 2, 4, 4, 3, 5, 4, 7, 3, 3...
## $ num_words             <int> 14, 12, 10, 25, 10, 4, 13, 12, 7, 8, 12,...

From above, I analyze the frequency of words appear when peopel describe their happy moments. We can find that friend, family, family have higher frequency, which means peoples’ happy moments happens more in coditions connetcted with them. For Next part, I am trying to divide people into differnt groups by some attributes, to explore what makes people happy specifically.

Different age stage

In this part, I focus on analyzing people’s happy moments in different age stage. According to the span of age, I try to divide people into 3 categories: teenager, adult and elder.

A glimpse of age stage.

The above plot is the number of words people used to describe their happy moments in different age stage. We can find that adult tend to use more words to descirbe their happy momens. This may because there are more adults sample in dataset. The average of words of people in different age stage descirbe their moments is a slightly different, older people tend to use more words.

The above plot is the proportion of different catogries accounts for differnt age stage peoples’ happy moments. We can find that the proportion is basically the same for people in different stage, achievement contributes a lot to peoples’ happy moments.

Analyzing word and document frequency: tf-idf

The tf-idf plot. We can find the words are important but not too commont to reflect people in differnet ages stages happy moments.

Teenager focus

Above plot is the word cloud plot of teenager’s happy moments. We can find friend, playing are two top words. Which means when teenager feels happy when they are with their friends and playing mostly in these situations.

Adult focus

Above is the word cloud plot of adults’ happy moments. We can find friend and work are tow top words, which means adults usually feel happy when they are with their friends or at work.

Elder focus

The above plot is word cloud plot of elder peoples’ happy moments. Here, we still find friend contributes a lot to their happy moments, but we also can find familys, such as wife, daughter, son appear a lot.

From above we can find that people in different age stage have different happy moments.

For the order of number of words they used to describe their happy moments, the order of average of number of words of people in different age stage is: elder > adult > teenager, which I think is caused by different life experience. For older people, they have more life experience and go through more things, so when they descirbe their happy moments, they tend to use more words. For what causes happy for them, according to the word cloud plot, we can find friend is a very important cause of happy for all people. And for elder people, family contributes a lot to their happy moments. For teenager and adult, besides family, playing also contributes a lot to their happy moments.

Different Martial Status

In this part, I am trying to find the difference of happy moments among people in different martial status. Here I just simply divide people’s martial status into two catogories: married and single.

The above plot is the proportion of happy moments categories accout for people in different marital status’ happy moments. Basically, they are the same.

length of words when describe happy moments

The above plot is the number of words people used to describe their happy moments in different marital status. We can find that married people tend to use more words to descirbe their happy moments.

Analyzing word and document frequency: tf-idf

The tf-idf plot. We can find the words are important but not too commont to reflect married or single people’s happy moments.

What makes people in different marital status happy.

Married people.

From above word cloud plot, we can find married peoples happy moments focus on their family and familys.

Single people.

From above, we can find that married people tend to use more words to describe their happy moments. Which is quiet understanble, since married people’s life is more clolorful in some way than single people, since they have a family. And for what causes people happy, we can find that single peopele’s happy moments focus on friends and work, however, for married people, their family and familys contribute a lot to their happy moments compared with single people.

Different gender

In this part, I will try to find what makes men or women happy, and what the difference between them.

The above plot is the proportion of happy moments categories accout for people in different gender happy moments. Basically, they are the same.

We can find the number of words male and female used to describe their happy moments are basically the same, the average number of words of female is a little higher, which is understandable, since girls life is very colorful, and they have more feelings about their life.

Analyzing word and document frequency: tf-idf

The tf-idf plot. We can find the words are important but not too commont to reflect men’s and women’s happy moments.

what makes them happy

male

From above word cloud plot, we can find work, friend and girl friend, wife contributes a lot to males’ happy moments.

female

From above, we can find that familys, freinds, work contributes a lot to females moments.

From above analysis, we can find that happy moments between male and female is not much different. Friends and their spouses contributes a lot to their happy moments.

Topic Modeling

Build corpus

Model calculation

## [1] 5000 4551

In original dateset, we have predictory_category, which is a classfication of happy moments by author. Here I use 10 top words to rename these 7 topics as “friend”, “job”, “affection”, leisure“,”achievement“,”family“,”trip“,”boding“.

Tests

We select 3 examples to visualize topics.

## $`1`
## [1] "supervisor complimented job received recognition"
## 
## $`30`
## [1] "appreciated academic performance head mechanical department"
## 
## $`120`
## [1] "surprise significant gift time reaction"

Form the barchart, we can see the probability of each example in each topic. Longer bar implies higher possibility in that topic. The 1st example is classfied as job category, which is reasonable. The 2nd example is classfied as achievement, which is resonable, according to the text, it is related to academic preformance. The third example is a little hard to classify, and according to the text, it seems so.

Summary

By analyzing the happy moments in the HappyDB dataset, we could get the following results.